{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "368686b4-f487-4dd4-aeff-37823976529d",
   "metadata": {},
   "source": [
    "# LlamaCPP \n",
    "\n",
    "In this short notebook, we show how to use the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) library with LlamaIndex.\n",
    "\n",
    "In this notebook, we use the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model, along with the proper prompt formatting. \n",
    "\n",
    "Note that if you're using a version of `llama-cpp-python` after version `0.1.79`, the model format has changed from `ggmlv3` to `gguf`. Old model files like the used in this notebook can be converted using scripts in the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) repo. Alternatively, you can download the GGUF version of the model above from [huggingface](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF).\n",
    "\n",
    "By default, if model_path and model_url are blank, the `LlamaCPP` module will load llama2-chat-13B in either format depending on your version.\n",
    "\n",
    "## Installation\n",
    "\n",
    "To get the best performance out of `LlamaCPP`, it is recomended to install the package so that it is compilied with GPU support. A full guide for installing this way is [here](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal).\n",
    "\n",
    "Full MACOS instructions are also [here](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/).\n",
    "\n",
    "In general:\n",
    "- Use `CuBLAS` if you have CUDA and an NVidia GPU\n",
    "- Use `METAL` if you are running on an M1/M2 MacBook\n",
    "- Use `CLBLAST` if you are running on an AMD/Intel GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "40a33749",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import HuggingFaceEmbeddings\n",
    "from llama_index import (\n",
    "    SimpleDirectoryReader,\n",
    "    VectorStoreIndex,\n",
    "    ServiceContext,\n",
    ")\n",
    "from llama_index.llms import LlamaCPP\n",
    "from llama_index.llms.llama_utils import messages_to_prompt, completion_to_prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7927630-0044-41fb-a8a6-8dc3d2adb608",
   "metadata": {},
   "source": [
    "## Setup LLM\n",
    "\n",
    "The LlamaCPP llm is highly configurable. Depending on the model being used, you'll want to pass in `messages_to_prompt` and `completion_to_prompt` functions to help format the model inputs.\n",
    "\n",
    "Since the default model is llama2-chat, we use the util functions found in [`llama_index.llms.llama_utils`](https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py).\n",
    "\n",
    "For any kwargs that need to be passed in during initialization, set them in `model_kwargs`. A full list of available model kwargs is available in the [LlamaCPP docs](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__init__).\n",
    "\n",
    "For any kwargs that need to be passed in during inference, you can set them in `generate_kwargs`. See the full list of [generate kwargs here](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).\n",
    "\n",
    "In general, the defaults are a great starting point. The example below shows configuration with all defaults.\n",
    "\n",
    "As noted above, we're using the [`llama-2-chat-13b-ggml`](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML) model in this notebook which uses the `ggmlv3` model format. If you are running a version of `llama-cpp-python` greater than `0.1.79`, you can replace the `model_url` below with `\"https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2640c7a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_url = \"https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6fa0ec4f-03ff-4e28-957f-b4b99a0faa20",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "llama.cpp: loading model from /Users/rchan/Library/Caches/llama_index/models/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
      "llama_model_load_internal: n_vocab    = 32000\n",
      "llama_model_load_internal: n_ctx      = 3900\n",
      "llama_model_load_internal: n_embd     = 5120\n",
      "llama_model_load_internal: n_mult     = 256\n",
      "llama_model_load_internal: n_head     = 40\n",
      "llama_model_load_internal: n_head_kv  = 40\n",
      "llama_model_load_internal: n_layer    = 40\n",
      "llama_model_load_internal: n_rot      = 128\n",
      "llama_model_load_internal: n_gqa      = 1\n",
      "llama_model_load_internal: rnorm_eps  = 5.0e-06\n",
      "llama_model_load_internal: n_ff       = 13824\n",
      "llama_model_load_internal: freq_base  = 10000.0\n",
      "llama_model_load_internal: freq_scale = 1\n",
      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
      "llama_model_load_internal: model size = 13B\n",
      "llama_model_load_internal: ggml ctx size =    0.11 MB\n",
      "llama_model_load_internal: mem required  = 6983.72 MB (+ 3046.88 MB per state)\n",
      "llama_new_context_with_model: kv self size  = 3046.88 MB\n",
      "ggml_metal_init: allocating\n",
      "ggml_metal_init: loading '/Users/rchan/opt/miniconda3/envs/llama-index/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'\n",
      "ggml_metal_init: loaded kernel_add                            0x14ff4f060\n",
      "ggml_metal_init: loaded kernel_add_row                        0x14ff4f2c0\n",
      "ggml_metal_init: loaded kernel_mul                            0x14ff4f520\n",
      "ggml_metal_init: loaded kernel_mul_row                        0x14ff4f780\n",
      "ggml_metal_init: loaded kernel_scale                          0x14ff4f9e0\n",
      "ggml_metal_init: loaded kernel_silu                           0x14ff4fc40\n",
      "ggml_metal_init: loaded kernel_relu                           0x14ff4fea0\n",
      "ggml_metal_init: loaded kernel_gelu                           0x11f7aef50\n",
      "ggml_metal_init: loaded kernel_soft_max                       0x11f7af380\n",
      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x11f7af5e0\n",
      "ggml_metal_init: loaded kernel_get_rows_f16                   0x11f7af840\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x11f7afaa0\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13ffba0c0\n",
      "ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13ffba320\n",
      "ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13ffba580\n",
      "ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13ffbaab0\n",
      "ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13ffbaea0\n",
      "ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13ffbb290\n",
      "ggml_metal_init: loaded kernel_rms_norm                       0x13ffbb690\n",
      "ggml_metal_init: loaded kernel_norm                           0x13ffbba80\n",
      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13ffbc070\n",
      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13ffbc510\n",
      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x11f7aff40\n",
      "ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x11f7b03e0\n",
      "ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x11f7b0880\n",
      "ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x11f7b0d20\n",
      "ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x11f7b11c0\n",
      "ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x11f7b1860\n",
      "ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x11f7b1d40\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x11f7b2220\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x11f7b2700\n",
      "ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x11f7b2be0\n",
      "ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x11f7b30c0\n",
      "ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x11f7b35a0\n",
      "ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x11f7b3a80\n",
      "ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x11f7b3f60\n",
      "ggml_metal_init: loaded kernel_rope                           0x11f7b41c0\n",
      "ggml_metal_init: loaded kernel_alibi_f32                      0x11f7b47c0\n",
      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x11f7b4d90\n",
      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x11f7b5360\n",
      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x11f7b5930\n",
      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
      "ggml_metal_init: hasUnifiedMemory             = true\n",
      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
      "llama_new_context_with_model: compute buffer total size =  356.03 MB\n",
      "llama_new_context_with_model: max tensor size =    87.89 MB\n",
      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6984.50 / 21845.34)\n",
      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.36 MB, ( 6985.86 / 21845.34)\n",
      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  3048.88 MB, (10034.73 / 21845.34)\n",
      "ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   354.70 MB, (10389.44 / 21845.34)\n",
      "AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n"
     ]
    }
   ],
   "source": [
    "llm = LlamaCPP(\n",
    "    # You can pass in the URL to a GGML model to download it automatically\n",
    "    model_url=model_url,\n",
    "    # optionally, you can set the path to a pre-downloaded model instead of model_url\n",
    "    model_path=None,\n",
    "    temperature=0.1,\n",
    "    max_new_tokens=256,\n",
    "    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room\n",
    "    context_window=3900,\n",
    "    # kwargs to pass to __call__()\n",
    "    generate_kwargs={},\n",
    "    # kwargs to pass to __init__()\n",
    "    # set to at least 1 to use GPU\n",
    "    model_kwargs={\"n_gpu_layers\": 1},\n",
    "    # transform inputs into Llama2 format\n",
    "    messages_to_prompt=messages_to_prompt,\n",
    "    completion_to_prompt=completion_to_prompt,\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "445453b1",
   "metadata": {},
   "source": [
    "We can tell that the model is using `metal` due to the logging!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e2e6a78-7e5d-4915-bcbf-6087edb30276",
   "metadata": {},
   "source": [
    "## Start using our `LlamaCPP` LLM abstraction!\n",
    "\n",
    "We can simply use the `complete` method of our `LlamaCPP` LLM abstraction to generate completions given a prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5cfaf34c-0348-415e-98bb-83f782d64fe9",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Of course, I'd be happy to help! Here's a short poem about cats and dogs:\n",
      "\n",
      "Cats and dogs, so different yet the same,\n",
      "Both furry friends, with their own special game.\n",
      "\n",
      "Cats purr and curl up tight,\n",
      "Dogs wag their tails with delight.\n",
      "\n",
      "Cats hunt mice with stealthy grace,\n",
      "Dogs chase after balls with joyful pace.\n",
      "\n",
      "But despite their differences, they share,\n",
      "A love for play and a love so fair.\n",
      "\n",
      "So here's to our feline and canine friends,\n",
      "Both equally dear, and both equally grand.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =  1204.19 ms\n",
      "llama_print_timings:      sample time =   106.79 ms /   146 runs   (    0.73 ms per token,  1367.14 tokens per second)\n",
      "llama_print_timings: prompt eval time =  1204.14 ms /    81 tokens (   14.87 ms per token,    67.27 tokens per second)\n",
      "llama_print_timings:        eval time =  7468.88 ms /   145 runs   (   51.51 ms per token,    19.41 tokens per second)\n",
      "llama_print_timings:       total time =  8993.90 ms\n"
     ]
    }
   ],
   "source": [
    "response = llm.complete(\"Hello! Can you tell me a poem about cats and dogs?\")\n",
    "print(response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9038f7d7",
   "metadata": {},
   "source": [
    "We can use the `stream_complete` endpoint to stream the response as it’s being generated rather than waiting for the entire response to be generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7b059409-cd9d-4651-979c-03b3943e94af",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Llama.generate: prefix-match hit\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Sure! Here's a poem about fast cars:\n",
      "\n",
      "Fast cars, sleek and strong\n",
      "Racing down the highway all day long\n",
      "Their engines purring smooth and sweet\n",
      "As they speed through the streets\n",
      "\n",
      "Their wheels grip the road with might\n",
      "As they take off like a shot in flight\n",
      "The wind rushes past with a roar\n",
      "As they leave all else behind\n",
      "\n",
      "With paint that shines like the sun\n",
      "And lines that curve like a dream\n",
      "They're a sight to behold, my son\n",
      "These fast cars, so sleek and serene\n",
      "\n",
      "So if you ever see one pass\n",
      "Don't be afraid to give a cheer\n",
      "For these machines of speed and grace\n",
      "Are truly something to admire and revere."
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =  1204.19 ms\n",
      "llama_print_timings:      sample time =   123.72 ms /   169 runs   (    0.73 ms per token,  1365.97 tokens per second)\n",
      "llama_print_timings: prompt eval time =   267.03 ms /    14 tokens (   19.07 ms per token,    52.43 tokens per second)\n",
      "llama_print_timings:        eval time =  8794.21 ms /   168 runs   (   52.35 ms per token,    19.10 tokens per second)\n",
      "llama_print_timings:       total time =  9485.38 ms\n"
     ]
    }
   ],
   "source": [
    "response_iter = llm.stream_complete(\"Can you write me a poem about fast cars?\")\n",
    "for response in response_iter:\n",
    "    print(response.delta, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7617600",
   "metadata": {},
   "source": [
    "## Query engine set up with LlamaCPP\n",
    "\n",
    "We can simply pass in the `LlamaCPP` LLM abstraction to the `LlamaIndex` query engine as usual:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d4c6f564",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use Huggingface embeddings\n",
    "embed_model = HuggingFaceEmbeddings(\n",
    "    model_name=\"sentence-transformers/all-mpnet-base-v2\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "eedcd31d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a service context\n",
    "service_context = ServiceContext.from_defaults(\n",
    "    llm=llm,\n",
    "    embed_model=embed_model,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5d485f1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# load documents\n",
    "documents = SimpleDirectoryReader(\n",
    "    \"../../../examples/paul_graham_essay/data\"\n",
    ").load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c55c33cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create vector store index\n",
    "index = VectorStoreIndex.from_documents(documents, service_context=service_context)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "e07659c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up query engine\n",
    "query_engine = index.as_query_engine()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "64e095c5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Llama.generate: prefix-match hit\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Based on the given context information, the author's childhood activities were writing short stories and programming. They wrote programs on punch cards using an early version of Fortran and later used a TRS-80 microcomputer to write simple games, a program to predict the height of model rockets, and a word processor that their father used to write at least one book.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "llama_print_timings:        load time =  1204.19 ms\n",
      "llama_print_timings:      sample time =    56.13 ms /    80 runs   (    0.70 ms per token,  1425.21 tokens per second)\n",
      "llama_print_timings: prompt eval time = 65280.71 ms /  2272 tokens (   28.73 ms per token,    34.80 tokens per second)\n",
      "llama_print_timings:        eval time =  6877.38 ms /    79 runs   (   87.06 ms per token,    11.49 tokens per second)\n",
      "llama_print_timings:       total time = 72315.85 ms\n"
     ]
    }
   ],
   "source": [
    "response = query_engine.query(\"What did the author do growing up?\")\n",
    "print(response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama-index",
   "language": "python",
   "name": "llama-index"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
